Table of Contents

Background

The current html file is designed to demonstrate a set of functions developed for simulating genetic values and phenotypes in the context of multi-environment trials (MET) using multi-parent advanced generation intercrossed (MAGIC) populations. The functions are designed to follow the basic workflow:

Workflow

  1. Generate Gametes
  2. Generate Numeric Genotypes
  3. Assign individuals to environemnts
  4. Produce Genetic Values
  5. Simulate MET

Generate Gametes

The initial workflow for the simulation is to build gametes using a function called SimGametes. The function works for biparental and multiparent populations formed from the intercrossing of multiple founders. The result of the function is a matrix of individuals by markers (n x m) with each individual being given a potential parental state at each marker. The probability of observing a crossover at each region of the genome (i.e., marker) is equal to the given the recombination frequency. For each crossover, one of the remaining potential parental genotypes is randomly sampled. Double cross overs are not allowed at each region such that informative recombination is observed at each crossover. The function also returns a data.table with columns for Genotype,Chrom,cM_Frequency, and cm_Count to give the average recombination frequency and number of crossovers per chromosome for each individual.

Arguments

  1. ParentalNum: the number of parents the population is formed from
  2. cM: recombination frequency as a percentage (i.e., a cM of 0.30 would be a 0.30% recombination frequency)
  3. Chrom_Num: number of chromosomes (default is 10)
  4. Num_Geno: number of gametes or double haploids (DH) to produce (default is 500)
  5. Num_Markers: number of genomic regions to simulate (default is 10,000 with an equal number of markers assumed per chromosome)

Biparental Population

  • The below example shows how to use the function SimGametes in the context of a biparental population using a recombination frequency of 0.30 with the remaining parameters set to default.
source("~/Simulations/G2Fsimulation/1-Gamete_Generatation/SimulateGametes.R")

#Biparental Populations
BP_GenotypeList<-SimGametes(ParentalNum = 2,cM = 0.3,Chrom_Num = 10,Num_Geno = 500,Num_Markers = 1000)
  BP_Genotype<-BP_GenotypeList[[1]]
    fwrite(BP_Genotype,"Results/BiparentalGenotypes.csv")
#..Average number of X0 in BP
  BP_summary<-BP_GenotypeList[[2]]
    fwrite(BP_summary,"Results/BiparentalGenotypesSummary.csv")
BP_Genotype<-fread("Results/BiparentalGenotypes.csv")
#..Format the data
  melt_bp<-melt(BP_Genotype,id.vars = 'Genotype',measure.vars = 2:ncol(BP_Genotype))
melt_bp$chr<-as.numeric(gsub("M(.+)_.*", "\\1", melt_bp$variable))
  melt_bp$pos<-as.numeric(gsub(".*_", "", melt_bp$variable))
setnames(melt_bp,c('value'),c('Haplotype'))

#Aesthetics
themeGamete<-function(x) theme(plot.title=element_text(hjust=0.5,size=18,face="bold"),strip.text.y = element_text(size=17),strip.text.x = element_text(size=16), axis.text.x = element_text(angle=45, hjust=1,size=15),axis.text.y = element_text(angle = 0, hjust = 1,size = 3),plot.subtitle = element_text(face = "italic",size = 16),axis.title.x = element_text(size = 18),axis.title.y = element_text(size = 18))
cbPallete<-c("#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7","#CC6677", "#DDCC77", "#117733", "#332288","#44AA99", "#999933", "#882255","#999999","#D55E00", "#CC79A7","#661100", "#6699CC", "#888888","#999999")

#Plot Biparental Population
ggplot(melt_bp[chr==1],aes(x=pos,y=Genotype))+geom_tile(aes(x=pos,y=Genotype,fill=Haplotype)) +theme_bw()+themeGamete()+
  ggtitle('Simulated Biparental Double Haploids',subtitle = 'Chromosome 1 Shown') + xlab('Position') + ylab("Genotype") +
    scale_fill_manual(values = cbPallete)+scale_x_continuous(expand = c(0, 0),breaks=seq(0,1000,100))
ggsave("Results/SimulatedDH_Biparental_XO.png",width = 10,height = 15)

The figure above shows chromosome 1 of the simulated double haploids (DHs) where the x-axis is the position (marker number) and y-axis represents each genotype. Color changes per row represent a crossover showing a change between the two possible parental states.

MAGIC Population

  • The below example simulates a MAGIC population formed from the intercrossing of 6 founders. The crossing scheme represents the crossing scheme used in the construction of the WI-SS-MAGIC population as described in Michel et al. (2022).
source("~/Simulations/G2Fsimulation/1-Gamete_Generatation/SimulateGametes.R")

#MAGIC Populations
MAGIC_GenotypeList<-SimGametes(ParentalNum = 6,cM = 0.3,Num_Markers = 1000) 
MAGIC_Genotype<-MAGIC_GenotypeList[[1]]
  fwrite(MAGIC_Genotype,"Results/MAGICparentalGenotypes.csv")
#..Average number of X0 in MAGIC
  MAGIC_summary<-MAGIC_GenotypeList[[2]]
    fwrite(MAGIC_summary,"Results/MAGICGenotypesSummary.csv")
MAGIC_Genotype<-fread("Results/MAGICparentalGenotypes.csv")
#..Format the data
  melt_magic<-melt(MAGIC_Genotype,id.vars = 'Genotype',measure.vars = 2:ncol(MAGIC_Genotype))
    melt_magic$chr<-as.numeric(gsub("M(.+)_.*", "\\1", melt_magic$variable))
melt_magic$pos<-as.numeric(gsub(".*_", "", melt_magic$variable))
  setnames(melt_magic,c('value'),c('Haplotype'))

#Aesthetics
themeGamete<-function(x) theme(plot.title=element_text(hjust=0.5,size=18,face="bold"),strip.text.y = element_text(size=17),strip.text.x = element_text(size=16), axis.text.x = element_text(angle=45, hjust=1,size=15),axis.text.y = element_text(angle = 0, hjust = 1,size = 3),plot.subtitle = element_text(face = "italic",size = 16),axis.title.x = element_text(size = 18),axis.title.y = element_text(size = 18))
cbPallete<-c("#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7","#CC6677", "#DDCC77", "#117733", "#332288","#44AA99", "#999933", "#882255","#999999","#D55E00", "#CC79A7","#661100", "#6699CC", "#888888","#999999")

#Plot the MAGIC Lines
ggplot(melt_magic[chr==1],aes(x=pos,y=Genotype))+geom_tile(aes(x=pos,y=Genotype,fill=Haplotype))+theme_bw()+themeGamete()+ ggtitle('Simulated MAGIC Double Haploids',subtitle = 'Chromosome 1 Shown') + xlab('Position') + ylab("Genotype") +scale_fill_manual(values = cbPallete)+scale_x_continuous(expand = c(0, 0),breaks=seq(0,1000,100))
ggsave("Results/SimulatedDH_MAGIC_XO.png",width = 10,height = 15)

The figure above shows the simualted MAGIC gametes formed from six parents all intercrossed. Each change in color per row represents a swtich in the paretnal state present at each genomic region. Examples of simulated gametes are shown below:

include_graphics(c("Results/MAGIC_Geno114.png"))

include_graphics(c("Results/MAGIC_Geno225.png"))

include_graphics(c("Results/MAGIC_Geno437.png"))

The above figures show each chromosome on the x-axis and positon on the y-axis for three different simulated DHs. The total number of crossovers per each individual is shown above.

Generate Numeric Genotypes

The generation of numeric genotypes based on parental states is shown using the function SimGenotypes. To generate numeric genotypes based on their parental states, numeric designations can be made randomly or by using an actual numeric matrix from non-simulated parents. If the numeric matrix of the founders is provided, the number of columns in the paretnal matrix (i.e., number of genetic markers) must match the gamete matrix with parental assignments from SimGametes.

Data Processing Step

  • Subset out a set of markers equal to the number of markers from SimGametes if using a parental numeric matrix.
    • Ensure the parental genotype names in the parental matrix match the parental names in the gamete matrix.
#Load in the gametes
gametes<-fread("Data/MAGICparentalGenotypesWI-SS-MAGIC.csv")
  kable_styling(kable(gametes[1:10,][,1:10],caption = 'Parental Assignments in Simualted DHs')) 
Parental Assignments in Simualted DHs
Genotype M1_1 M1_2 M1_3 M1_4 M1_5 M1_6 M1_7 M1_8 M1_9
Geno1 B73 B73 B73 B73 B73 B73 B73 B73 B73
Geno2 PHB47 PHB47 PHB47 PHB47 PHB47 PHB47 PHB47 PHB47 PHB47
Geno3 PHB47 PHB47 PHB47 PHB47 PHB47 PHB47 PHB47 PHB47 PHB47
Geno4 LH145 LH145 LH145 LH145 LH145 LH145 LH145 LH145 LH145
Geno5 PHJ40 PHJ40 PHJ40 PHJ40 PHJ40 LH145 LH145 LH145 LH145
Geno6 NKH8431 NKH8431 NKH8431 NKH8431 NKH8431 NKH8431 NKH8431 NKH8431 NKH8431
Geno7 PHJ40 PHJ40 PHJ40 PHJ40 PHJ40 PHJ40 PHJ40 PHJ40 PHJ40
Geno8 LH145 LH145 LH145 LH145 LH145 LH145 LH145 LH145 LH145
Geno9 LH145 LH145 LH145 LH145 LH145 B73 B73 B73 B73
Geno10 PHJ40 PHJ40 PHJ40 PHJ40 PHJ40 PHJ40 PHJ40 PHJ40 PHJ40
#Load in the actual genotype data
# num<-fread("~/Agriplex_SNP_Assay/4-SignificantMarkers_GWAS/Data/Exome50kHirsch/Hirsch_ExomeCapture5missImputed_100k.txt")
num<-fread("Data/InbredNumerical_G2F_5MAF_20k.txt")
  colnames(num)[1]<-'Inbred'
  
#MAF of the parents
source("~/Breeding_Functions/R/MAF.R")  
  parM<-num[341:346,]
#..MAF of the parents  
  parMAF<-MAF(x = parM)
    # parMAF$Chrom<-as.numeric(gsub("rs(.+):.*", "\\1", parMAF$Marker))
    parMAF$Chrom<-as.numeric(gsub("S(.+)_.*", "\\1", parMAF$Marker))

#A) Obtain actual genetic parameters from WI-SS-MAGIC DHs ####
    
#Sample 10k markers per chromosome
chr<-seq(1,10,1)
  ChromList<-list()
#..random 1,000 per chromosome
    for (i in 1:length(chr)){
      ChromList[[i]]<-sample_n(parMAF[Chrom==chr[i]],1000)
}
SampleDT<-rbindlist(ChromList)        
  # table(SampleDT$MAF)
#..Matching Markers between the random subset the numeric matrix
  markSamp<-unique(SampleDT$Marker)
    ParentNumSample<-cbind(parM[,c('Inbred')],parM[, ..markSamp])
fwrite(ParentNumSample,"Results/ParentalNumericalSubsetPHG.csv")

save.image("SavePoints/1-GeneticParameters.RData")
rm(list = setdiff(ls(),'gametes'))

Run SimGenotypes

  • The function SimGenotypes is run below using a numeric matrix that represents the genetic states of the six parents of the WI-SS-MAGIC Population.
#Parental marker file
par<-fread("Results/ParentalNumericalSubsetPHG.csv")
  par$Inbred<-c('B73','B84','LH145','PHB47','NKH8431','PHJ40')
parMat<-as.matrix(par[,c(2:10001)])
  rownames(parMat)<-par$Inbred
    kable_styling(kable(parMat[,1:10],caption = 'Numerical Assignments of Inbred Parents in the WI-SS-MAGIC Population.'))
Numerical Assignments of Inbred Parents in the WI-SS-MAGIC Population.
S1_108127968 S1_18899761 S1_199318883 S1_181853898 S1_179370851 S1_17220356 S1_56695382 S1_256457189 S1_302195059 S1_38222110
B73 1 0 1 1 1 1 1 0 1 1
B84 1 1 1 1 1 1 1 1 1 0
LH145 1 1 0 1 1 1 0 1 1 1
PHB47 1 0 1 1 1 1 1 1 0 1
NKH8431 0 1 1 0 0 0 0 1 1 1
PHJ40 0 1 1 0 0 0 0 1 0 1
#Run SimGenotypes to Produce Numeric Values
source("~/Simulations/G2Fsimulation/2-GenotypeGeneration/ConstructSimulatedGenotypesV2.R")
NumDT<-SimGenotypes(Gametes = gametes,ParentMat = parMat,nParents = NULL)
  fwrite(NumDT,"Results/NumericalDesignationsPHG.csv")
NumDT<-fread("Results/NumericalDesignationsPHG.csv")
  kable_styling(kable(NumDT[1:10,][,1:10],caption = 'Numerical Assignments of Simualted DHs'))
Numerical Assignments of Simualted DHs
Genotype M1_1 M1_2 M1_3 M1_4 M1_5 M1_6 M1_7 M1_8 M1_9
Geno1 1 0 1 1 1 1 1 0 1
Geno2 1 0 1 1 1 1 1 1 0
Geno3 1 0 1 1 1 1 1 1 0
Geno4 1 1 0 1 1 1 0 1 1
Geno5 0 1 1 0 0 1 0 1 1
Geno6 0 1 1 0 0 0 0 1 1
Geno7 0 1 1 0 0 0 0 1 0
Geno8 1 1 0 1 1 1 0 1 1
Geno9 1 1 0 1 1 1 1 0 1
Geno10 0 1 1 0 0 0 0 1 0

Population of Simulated Progeny

The actual genotypes of the parents used to generate the simulated progeny are included below in the MDS plot.

NumDT<-fread("Data/NumericalDesignationsPHG.csv")
  colnames(NumDT)[1]<-'Inbred'
ParentDT<-fread("Results/ParentalNumericalSubsetPHG.csv")
colnames(ParentDT)<-colnames(NumDT)
  ParentDT$Inbred<-c('B73','B84','LH145','PHB47','NKH8431','PHJ40')
    combineDT<-rbind(NumDT,ParentDT)
      combineDT<-combineDT[!Inbred %in% c('PHJ40')]
#.A) Distance matrix on the Individuals ####
  d<-dist(as.matrix(combineDT[,-c(1)]))
    fit <- isoMDS(as.matrix(d,d), k=2) # k is the number of dim
      mds_dt<-data.table(Inbred=c(combineDT$Inbred),x=fit$points[,1],y=fit$points[,2])

#..Column for Parents
  mds_dt[Inbred %in% c('B73','B84','LH145','PHB47','NKH8431','PHJ40'),Individuals:='Parents']
    mds_dt[!Inbred %in% c('B73','B84','LH145','PHB47','NKH8431','PHJ40'),Individuals:='Progeny']
  
#Save the results 
  fwrite(mds_dt,"Results/MDSsimulatedProgeny.csv")
mds_dt<-fread("Results/MDSsimulatedProgeny.csv")
#.B) Plot the Results ####
themeMDS<-function(x) theme(plot.title=element_text(hjust=0.5,face = "bold"),text = element_text(size=18),strip.text.x = element_text(size=18),axis.text.x = element_text(angle=0, hjust=1,size=18),axis.text.y = element_text(angle=0, hjust=1,size=18),axis.title.x =  element_text(angle = 0,size = 20), plot.subtitle = element_text(size = 18,face="italic"), axis.title.y =  element_text(angle = 0,vjust = .5,size = 20))  
    
#Plot    
ggplot(mds_dt,aes(x=x,y=y,color=Individuals)) + geom_point(size=2) + geom_point(data = mds_dt[Individuals=='Parents'],aes(x=x,y=y,color=Individuals),size=3)+
  ggtitle("MDS for Simulated MAGIC \n Double Haploids")+theme_bw()+ themeMDS()+
geom_label(data = mds_dt[Individuals=='Parents'],aes(label=Inbred))

  ggsave("Results/Simulated_MDS_Plot.png",width = 8,height = 8)  
    
#C) Plot MAF ####
source('~/Breeding_Functions/R/MAF.R')
#Load in the numerical file  
  NumDT<-fread("Results/NumericalDesignationsPHG.csv")
  colnames(NumDT)[1]<-'Inbred'
# Calculate MAF
MAFprogeny<-MAF(NumDT)
  ggplot(MAFprogeny,aes(x=MAF)) + geom_histogram(alpha=.4,color='black') +ggtitle('Minor Allele Frequency of the simulated Progeny')+theme_bw()

ggsave("Results/SimulatedMAF.png",width = 8,height = 5)

#save.image("SavePoints/3-Genotype_Plot.RData") 

Minor allele frequencies peak around \(\frac{1}{6}\), \(\frac{2}{6}\), and \(\frac{3}{6}\) due to the six possible foudners.

Assign Individuals to Environments

To assign genotypes randomly to environments, the function SimEnvironment is designed to assign a total number of ‘N’ individuals to ‘E’ environments given a desired number of replications per environment. The function has the ability to hand partially replicated experiments (i.e., only a given percentage of the genotype in an environment replicated) and incomplete experimental designs (i.e., not all genotypes in each environment). Incomplete experimental designs are constructed by assigning a given number of experimental units (EU) to each environment, and providing a desired level of replication and population size such that not all genotypes can be grown in each environment.

Finally, for partially replicated experiments, the genotypes replicated in each environment can either be assigned at random or by deliberatly balancing out the number of total observations for each genotype acoss the E number of environments.

Arguments

  1. NumEU: a numeric value corresponding to the overall number of EU to assign to each environment
  2. PopSize: an optional numeric value corresponding to the overall number of individuals to sample from when assigning genotypes to environments. Default is NULL.
  3. GenoName: an optional vector with length equal to PopSize that contains the genotype names to be assigned to the lines in the MET
  4. nEnv: a numeric value corresponding to the number of environments in the MET
  5. nRep: a numeric value corresponding to the number of replicates in each environment.
  6. percentRep: a numeric value between 0 and 1 corresponding to the percent replication of the last replicate.
  7. RepAssignment: either ‘Random’ (default) or ‘Balanced’ if the number of observation per genotype should be even across the total number of environments.

Partially Replicated and Incomplete Design Example

#Functions for simulation
source("~/Simulations/G2Fsimulation/3-PhenotypeSimulations/EnvironmentAssignment.R")

#2) Assign genotypes to Environment ####
    
#Approxiamte simualtion represenative of the Genomes 2 Fields data for 2020 and 2021 on a per testcross basis
EnvDT<-SimEnvironments(NumEU = 500,PopSize = 500,nEnv = 23,nRep = 2,percentRep = .25)
  fwrite(EnvDT,"Results/EnvironmentAssignments.csv")  
#..Save  
    EnvDT<-fread("Results/EnvironmentAssignments.csv")  
EnvDT$Rep<-as.factor(EnvDT$Rep)
  EnvDT<-EnvDT[order(EnvDT$Env,decreasing = T)]
    
#..A) Plot environmental Assignment ####
cbPallete<-c("#56B4E9","#E69F00", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7","#CC6677", "#DDCC77", "#117733", "#332288","#44AA99", "#999933", "#882255","#999999","#D55E00", "#CC79A7","#661100", "#6699CC", "#888888","#999999")
themeEnvAssign<-function(x) theme(plot.title=element_text(hjust=0.5,size=18,face="bold"),strip.text.y = element_text(size=17),strip.text.x = element_text(size=16), axis.text.x = element_text(angle=90, hjust=1,size=3),axis.text.y = element_text(angle = 0, hjust = 1,size = 15),plot.subtitle = element_text(face = "italic",size = 16),axis.title.x = element_text(size = 18),axis.title.y = element_text(size = 18))

#Plot
ggplot(EnvDT,aes(Genotype,Env)) +geom_tile(aes(x=Genotype,y=Env,fill=Rep)) +theme_bw()+themeEnvAssign()+ggtitle('Number of Individuals per Environment') + xlab('Genotype') + ylab("Environment")+scale_fill_manual(values = cbPallete)
ggsave("Results/SimulatedEnvAssignments.png",width = 15,height = 10)  

save.image('SavePoints/1-AssignEnvironments.RData')

The example above is incomplete as only 25% of the genotypes in a given environment are replicated. Furhturmore, the population size was set to 500 but the EUs were also set to 500. Therefore, only 375 genotypes could be grown in a given environment. This leads to the white spaces in the figure above as only 375 genotypes are randomly subset out for each environment.

Produce Genetic Values

Additive values are generated either from the previously simulated molecular markers or can be generated internally in the function. This is done by providing a marker matrix via the argument ‘X’ in the function SimPhenotypes. Otherwise, genetic markers are randomly generated internally. Additive values are determined based on a given additive effect at each loci, a, as described by Lande and Thompson, 1990. By default, additve effects of the QTL are simulated from a standard normal distribution: \(u_{QTL} \text~ N(0,1)\). In the SimPhenotypes function, the variance parameter can be modified with the argument ‘a’ but is set to 1 by default.

The recommendation is that the markers are coded as -1 and 1 based on the A matrix theory described in Endelman and Jannink, 2012 but is not required.

The SimPhenotypes works in the context of a single environment, or a MET. For a single environment, an error variance can be provided to simultaneously generate breeding values, genotypic values, and phenotypes. The error provided must be between 0 and 1 and thus determines a desired trait heritability if the goal is to simulate phenotypes. However, if a MET is desired, an error variance is provided as shown under the Simulate MET section. For a MET, marker effects are independently simulated for each environment to generate a genotype-by-environment (GxE) interaction variance. By default, the QTL (i.e., marker) associated with the traits is the same across all environments but the parameter can be changed via the argument ‘GxE’ as provided below.

Data Processing Step

#Function to calculate MAf
source("~/Breeding_Functions/R/MAF.R")

#Load in the data
gt<-fread("Data/NumericalDesignations.csv")

#..Convert to matrix
  gtMat<-as.matrix(gt[,c(2:10001)])
    gtMat[gtMat==0]<- -1
    
# Accounting for MAF in the Simulated WI-SS-MAGIC Population
MAF_gt<-MAF(gt)
#...acount for  MAF
    gtMAF2<-matrix(nrow=nrow(gtMat), ncol=ncol(gtMat))
for (i in 1:nrow(gtMat)){
  for (j in 1:ncol(gtMat)){
    gtMAF[i,j]<-gtMat[i,j]-2*MAF_gt[j,]$MAF
  }
  print(i/nrow(gtMat)*100)
}
ScaledNumericGenotypes<-cbind(data.table(Genotype=c(rownames(gtMAF))), as.data.table(gtMAF))  
  fwrite(ScaledNumericGenotypes,'Results/NumericGenotypeWI-SS-MAGICmaf.csv')

In the simulated MAGIC population, as expected, the progeny minor allele frequency will fall around \(\frac{1}{6}\), \(\frac{2}{6}\), or \(\frac{3}{6}\) due to the six potential founders possible at each locus.

SimPhenotypes for 1 Environment

  • a: By default, SimPhenotypes assumes additive effects are simulated from a variance of 1 with a mean of 0. To generate only a single additive value at each QTL, suggestions of potential values of ‘a’ are provided in [Alternative additive values for QTL can be found at Stich, 2009.
  1. nQTL: number of regions associated with the trait of interest
  2. NumIndiv: number of individuals to simulate if a marker matrix is not provided (default is 100).
  3. envDT: an optional environmental assignments from SimEnvironments if a MET is being simulated
  4. a: numeric value corresponding to the variance for simulating addtive effects. By default, a is 1 as effects are simulated from a mean of 0 with variance of 1.
  5. Effect: a numeric value for simulating the mean of the marker effects. By default, the mean is 0 as effects are simulated from a mean of 0 with variance of 1.
  6. X: an optional marker matrix (individuals x markers), recommended but notrequired to be coded as -1 and 1 with the rownames provided as genotypes.
  7. Poly: an optional numeric value between 0 and 1 for the magnitude of the background polygenic effect compared to the a value provided.
  8. Vg: an optional numeric variance to simulate non-additive genetic effects
  9. QTLnames: an optional argument to specificy the markers to use and must be equal in length to the number of markers specified by ‘nQTL’
  10. QTLallele: specifying if the largest value in the numeric matrix (i.e., 1 or -1) is the major or minor allele.
  11. GxE: ‘Effect’ (default) or ‘Marker’ for if GxE should be modeled by allowing unique QTL effect estimates per environment or unique QTL (i.e., different markers) and effect estimates
source("~/Simulations/G2Fsimulation/3-PhenotypeSimulations/SimulatePhenotype.R")

#3) Simulate Additive values ####
gt<-fread('Results/NumericGenotypeWI-SS-MAGICmaf.csv')
#..Convert to matrix
  gtMat<-as.matrix(gt[,-c(1)])
    rownames(gtMat)<-gt$Genotype
#Phenotypes from markers with 0 tester effect
#..GY
  GY<-SimPhenotypes(NumIndiv = 500,X = gtMat,
         nQTL =100,QTLallele = 'Major')
    GY$Phenotype<-'GY'
      GY$Position<-100
#..GM  
  GM<-SimPhenotypes(NumIndiv = 500,X = gtMat,
      nQTL = 40,QTLallele = 'Major')
GM$Phenotype<-'GM'
  GM$Position<-40

#..Flowering Time
  Flower<-SimPhenotypes(NumIndiv = 500,X = gtMat,nQTL = 30,QTLallele = 'Major')
Flower$Phenotype<-'Flower'
  Flower$Position<-30
#..Plant Height
  PH<-SimPhenotypes(NumIndiv = 500,X = gtMat,
                     nQTL =30,QTLallele = 'Major')
PH$Phenotype<-'PH'
  PH$Position<-30

#Combine the phenotypes
PhenoEx<-rbind(GY,GM,Flower,PH)      
  PhenoEx$Phenotype<-factor(as.factor(PhenoEx$Phenotype), levels = c('PH','Flower','GM','GY'))
#Save 
  fwrite(PhenoEx,"Results/SimulatedValues.csv")
PhenoEx<-fread("Results/SimulatedValues.csv")
#A) Plot
themeBV<-function(x) theme(plot.title=element_text(hjust=0.5),text = element_text(size=18),strip.text.x = element_text(size=18), axis.text.x = element_text(angle=45, hjust=1,size=15),axis.text.y = element_text(angle = 0, hjust = 1,size = 15),plot.subtitle = element_text(face = 'italic',size = 15))

#Plot
ggplot(PhenoEx,aes(BreedVal)) + geom_density(aes(color=Phenotype))+ ggtitle('Distribution of additive values across 4 Traits',subtitle = 'Number of Positions associated with the \n trait shown  below trait name')+theme_bw() + themeBV() +facet_wrap(~Phenotype+Position,scales = 'free',ncol = 2)+xlab('Simulated Additive Values')
  ggsave("Results/Simualted_BV_Position.png",width = 10,height = 8)

save.image("SavePoints/2-BVcreation.RData")

From above, the genetic variance associated with each trait is a function of the number of QTL associated with the trait of interest. This is demonstrated by simulating additive values for grain yield, assuming 100 QTLs and flowering time, assuming 30 QTLs. For grain yield, the range is betwen approximately -25 and 32 while flowering time ranges between -15 and 15.

Simulating QTL with different Effects

While a few QTL may generally exhibit the largest effects, many QTL with small effects contribute to quanitative variation in observed phenotypes To account for these genomic regions, SimPhenotypes has an argument called ‘Poly’, to model the polygenic background effect of QTL. Poly takes a single numeric value between 0 and 1, to represent the size of the polygenic background effect as a percent of the original additive effect (a) previoulsy specified.

The below examples builds upon the previous section by simulating genotypic values for a trait controlled by 100 regions with 9,900 additional regions having an effect 1% of that of the QTL effect.

source("~/Simulations/G2Fsimulation/3-PhenotypeSimulations/SimulatePhenotype.R")

#4) Simulate Additive values for METs ####
gt<-fread('Results/NumericGenotypeWI-SS-MAGICmaf.csv')
#..Convert to matrix
  gtMat<-as.matrix(gt[,c(2:10001)])
    rownames(gtMat)<-gt$Genotype
    
#Phenotype with 100 QTL and 9900 background QTL 
  GYpoly<-SimPhenotypes(NumIndiv = 500,X = gtMat,Poly = .01,
         nQTL =100,QTLallele = 'Major')
    GYpoly$Phenotype<-'GY'
      GYpoly$Position<-'Polygenic'
      fwrite(GYpoly,'Results/SimulatedValuesPoly.csv')
#Plot the polygenic and non-polygenic effect QTL
qtlDT<-fread("Results/SimulatedValues.csv")[Phenotype=='GY'][,-c('Position')]
qtlDT$Position<-'QTL'
#..load in the polygenic table      
    GYpoly<-fread('Results/SimulatedValuesPoly.csv')
      phenoDT<-rbind(GYpoly,qtlDT)
  
#Plot the results
themePoly<-function(x) theme(plot.title=element_text(hjust=0.5),text = element_text(size=18),strip.text.x = element_text(size=18), axis.text.x = element_text(angle=45, hjust=1,size=15),axis.text.y = element_text(angle = 0, hjust = 1,size = 15),plot.subtitle = element_text(face = 'italic',size = 15))

#Plot
ggplot(phenoDT,aes(BreedVal)) + geom_density(aes(color=Phenotype))+ ggtitle('Distribution of additive values with \n and without the polygenic effect',subtitle = 'Number of QTLs assumed to be 100 per trait plus a \n polygenic effect of 1% ')+theme_bw() + themePoly() +facet_wrap(~Phenotype+Position,scales = 'free',ncol = 2)+xlab('Simulated Additive Values')

  ggsave("Results/Simualted_BVpoly_Position.png",width = 10,height = 8)

Assuming that each of the additional markers has an effect of 1% of that of the QTL effect increases the range of the simulated additive values. This increases the available genetic variation compared to only simulating 30 or 100 QTL associated with the trait of interest.

SimPhenotypes for METs

While a single genetic model can be built for a group of genotypes, plant breeders often desire to understand the performance of their population over a target population of environments (TPE). Therefore, SimPhenotypes can accomadate multi-environment trials by providing an additional argument, ‘envDT.’ The argument takes the data.table produced from SimEnvironments and generates a unique genotypic or additive value per environment based on the specified genetic model. Generating unique marker effects per environments simualtes GxE.

The below example assume a genetic model where only 100 QTL are associated with the trait of interest and unique marker effects are generated for each environment. However, the marker effects are generated for the same 100 QTL in each environment.

source("~/Simulations/G2Fsimulation/3-PhenotypeSimulations/SimulatePhenotype.R")

#4) Simulate Additive values for METs ####
gt<-fread('Results/NumericGenotypeWI-SS-MAGICmaf.csv')
#..Convert to matrix
  gtMat<-as.matrix(gt[,c(2:10001)])
    rownames(gtMat)<-gt$Genotype

#.A) Simulate Additive Values for Multiple Environments ####   
eDT<-fread('Results/EnvironmentAssignments.csv')    
    eDTsub<-eDT[Env %in% c('Env1','Env2','Env3','Env4')]
GxEdt<-SimPhenotypes(NumIndiv = nrow(gtMat),envDT = eDTsub,X = gtMat,
         nQTL =100,QTLallele = 'Major')  
  fwrite(GxEdt,'Results/EnvironmentValuesMET.csv')
  GxEdt<-fread('Results/EnvironmentValuesMET.csv')
#..Show the environmental correlations
  wideGxE<-dcast.data.table(GxEdt,Genotype ~Env, value.var = 'BreedVal')
    envs<-unique(GxEdt$Env)
#Plot the environment correlations      
    ggcorrplot(cor(wideGxE[, ..envs],use = 'pairwise.complete.obs'), title = 'Genetic correlation among simulated environemnts \n assuming a genetic model with 100 positions',legend.title = 'Correlation')  

    ggsave("Results/ExampleEnvCorrelation.png",width = 8,height = 5)

No covariance among the environments is assumed. Therefore, the environments are expected to have a correlation around \(0\) for all pairs of environments as shown above.

Simulate MET

To simulate components of variance associated with phenotypic variation in a multi-environment trial (MET), the function SimMET is implemented. The function takes the result from SimEnvironment and SimPhenotypes along with a numeric value for the component of variance due to the environment \((V_e)\), withinplot error \((V_{error})\), and a mean \(u\). A block effect can also be assumed by treating the replication as a block via the argument \(V_{block}\). The genotype-by-environment interaction is obtained from the previous specification of random marker effects by environment as was accomplished using the function SimPhenoytpes. Changing the within plot error between a value of 0 and 1 will shift the within environment plot based trait heritability from 1 to 0 respectively. The result is simulated phenotypic values in each of the environemnts of interest. For the \(V_{error}\) argument, the random residual variance is generated from \(\frac{PEV}{h^2} *\mu_a\) as demonstrated by Endelman, 2011 in rrBLUP where the \(PEV\) is specified via \(V_{error}\) to acheive a desired trait heritability.

Example of a MET

  • Below assumes the \((V_e)\) is 75 (correspodnign to a standard deviation of 8.66), \((V_{error})\) is 0.80 \((h^2=0.20)\), and the mean for each environment is assumed to be 170. A total of 100 QTL are associated with the trait of interest as shown above.
#5) Simulate the assignments error, field effect, environment,  rep, and GxE ####
gyAdd <-fread("Results/EnvironmentValuesMET.csv")
  EnvDT<-fread("Results/EnvironmentAssignments.csv")[Env %in% c('Env1','Env2','Env3','Env4')]

#Function for Assigning individuals to environments in METs
source("~/Simulations/G2Fsimulation/3-PhenotypeSimulations/SimulateMET.R")
#..A) Simulate MET ####
  METdt<-SimMET(dt = EnvDT,gt = gyAdd,u = 170,Ve = 75,Verror = .80)
    fwrite(METdt,"Results/SimulatedMET.csv")
#..B) Plot the environments     
    METdt<-fread("Results/SimulatedMET.csv")
themeMET<-function(x) theme(plot.title=element_text(hjust=0.5),text = element_text(size=18),strip.text.x = element_text(size=18), axis.text.x = element_text(angle=45, hjust=1,size=15),axis.text.y = element_text(angle = 0, hjust = 1,size = 15),plot.subtitle = element_text(face = 'italic',size = 15))
  
#Plot
ggplot(METdt,aes(PhenoVal)) + geom_density(aes(color=Env))+ ggtitle('Distribution of Simualted phenotypes \n across 4 environments',
 subtitle = 'Grain yield shown assumed to be controlled by 100 positions with an additive variance of 40')+theme_bw() + themeMET() +facet_wrap(~Env,scales = 'free')+theme(legend.position = 'none')+xlab('Grain Yield (bu/Ac)')
  ggsave("Results/Simualted_Env_Distributions.png",width = 10,height = 8)
  
#Correlate the replications
repWide<-dcast.data.table(METdt,Genotype+Env~Rep,value.var = 'PhenoVal')  
  setnames(repWide,c('1','2'),c('One','Two'))
repWideClean<-repWide[!is.na(One) & !is.na(Two)]  
  repCors<-rbindlist(lapply(split(repWideClean,repWideClean$Env), function(x) {
        RepCors<-data.table(Env=c(unique(x$Env)),RepCor=c(round(cor(x$One,x$Two),digits = 2)))
    return(RepCors)})) 
  
#..Show the replicate correlations    
    kable_styling(kable(repCors,caption = 'Replicate Correlations'))
Replicate Correlations
Env RepCor
Env1 0.26
Env2 0.30
Env3 0.16
Env4 0.12

The replicate correlations are the result of purely the residual variance \((V_{error})\) simulated as no spatial affect are being simulated and block variance is set to 0.

METdt<-fread("Results/SimulatedMET.csv")

#C) Heritabilitiy of the phenotypes
  METdt$pheno<-'GY'
    METdt$tester<-'No_effect'

  metSub<-METdt[,c('Genotype','Env','Rep','pheno','tester','PhenoVal')]
  
#Format for stage 1 function
  setnames(metSub,c('Genotype','Env','Rep','pheno','PhenoVal'),c('pedigree','environment','rep','variable','value'))

source("~/Genomes2Fields/1-Stage1/Stage1Function.R")

#...Run Stage 1      
    S1results<-Stage1Analysis(dt.tidy = metSub,dt.covar = NULL,covars = NULL,SpatCor = .60,no_covar = NULL,DF = 30)  

#..i) Plot the Heritability ####
  h2DT<-rbindlist(S1results[[6]])
fwrite(h2DT,"Results/EvenErrorH2.csv")
themecustomH2<-function(x) theme(plot.title=element_text(hjust=0.5),text = element_text(size=18),strip.text.y = element_text(size=25),  axis.text.x = element_text(angle=45, hjust=1,size=15),axis.text.y = element_text(angle = 0, hjust = 1,size = 15))
cbPallete<-c("#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7","#CC6677", "#CC3311","#009988")

#Load in the data
h2DT<-fread("Results/EvenErrorH2.csv")

#Plot maximum heritability by year      
ggplot(h2DT,aes(x=environment, y=H2_cullis,fill=environment))+geom_col(position = "dodge",aes(x=environment,y=H2_cullis,fill=environment),width = .5) +theme_bw() + scale_y_continuous(breaks = seq(0,100,10)) +ggtitle("Heritability for grain yield per simulated environment") +ylab("Heritability (%)") + xlab("Environment")+themecustomH2()+theme(plot.subtitle = element_text(face = "italic")) + scale_fill_manual(values = cbPallete)+ theme(legend.position = "bottom") 
    ggsave("Results/Simulated_Env_Heritability.png",width = 15, height = 10)  

The figures above show the variation in heritability and phenotypic values between the four simulated environments.

Unique Error Variances across Environments

The above example assumes one common error variance across environments. However, the error between envrionments is generally assumed to be independent. To accomadate a scenario where the goal is to simulate error variances unique to each environment, the function SimMET accepts a vector of error variances. This vector is sampled from for each environment to assign unique and independent error variances to each environment.

The below example assumes the potential random error variances range from 0.20 to 0.80 to demonstrate a trait heritability of 0.80 to 0.20. At random, one error variance is selected such that a random heritability is generated. Beyond generating random errors, the means of the environments can additionally be allowed to vary instead of providing a variance for the environmental effect. This is done by providing a vector of means via the ‘u’ argument. Below, the means of the environments can vary between 160 and 190, the equivalent of providing an environmental variance component equal to 82.67 for generating random environmental effects.

source("~/Genomes2Fields/1-Stage1/Stage1Function.R")
source("~/Simulations/G2Fsimulation/3-PhenotypeSimulations/SimulateMET.R")

#Load in the data
gyAdd <-fread("Results/EnvironmentValuesMET.csv")
  EnvDT<-fread("Results/EnvironmentAssignments.csv")[Env %in% c('Env1','Env2','Env3','Env4')]
ERROR<-seq(0.20,0.80,.01)
EnvVar<-var(seq(160,190,1))
EnvMeans<-mean(seq(160,190,1))

#3) Simulate MET with uneven Variances ####
METunevenErrorDT<-SimMET(dt = EnvDT,gt = gyAdd,u = EnvMeans,Ve = EnvVar,Verror = ERROR)
  fwrite(METunevenErrorDT,"Results/SimulatedMETunevenDT.csv")

#..A) Run Stage 1 ####
METunevenErrorDT$pheno<-'GY'
  METunevenErrorDT$tester<-'No_effect'
    metSubUneven<-METunevenErrorDT[,c('Genotype','Env','Rep','pheno','tester','PhenoVal')]
  
#Format for stage 1 function
  setnames(metSubUneven,c('Genotype','Env','Rep','pheno','PhenoVal'),c('pedigree','environment','rep','variable','value'))    
  
#Stage 1    
  S1resultsUnevenError<-Stage1Analysis(dt.tidy = metSubUneven,dt.covar = NULL,covars = NULL,SpatCor = .60,no_covar = NULL,DF = 30)  
#..Save results    
  h2UnevenDT<-rbindlist(S1resultsUnevenError[[6]])
    fwrite(h2UnevenDT,"Results/UnevenErrorH2.csv")
  VCUnevenDT<-rbindlist(S1resultsUnevenError[[1]])
    fwrite(VCUnevenDT,"Results/VCUnevenErrorH2.csv")   
  BLUEsUnevenDT<-rbindlist(S1resultsUnevenError[[4]])
    fwrite(BLUEsUnevenDT,"Results/BLUEsUnevenErrorH2.csv")    
#Load in the unequal error heritabilities, VC, and BLUES 
h2UnevenDT<-fread("Results/UnevenErrorH2.csv")
VCUnevenDT<-fread("Results/VCUnevenErrorH2.csv")
BLUEsUnevenDT<-fread("Results/BLUEsUnevenErrorH2.csv")
  BLUEsUnevenDT$pedigree<-as.factor(BLUEsUnevenDT$pedigree)
  BLUEsUnevenDT$environment<-as.factor(BLUEsUnevenDT$environment)

#Plot aesthetics
themecustomH2<-function(x) theme(plot.title=element_text(hjust=0.5),text = element_text(size=18),strip.text.y = element_text(size=25),  axis.text.x = element_text(angle=45, hjust=1,size=15),axis.text.y = element_text(angle = 0, hjust = 1,size = 15))

cbPallete<-c("#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7","#CC6677", "#CC3311","#009988")

#..i) Plot the Heritability ####
    
#Plot maximum heritability by year      
ggplot(h2UnevenDT,aes(x=environment, y=H2_cullis,fill=environment))+geom_col(position = "dodge",aes(x=environment,y=H2_cullis,fill=environment),width = .5) +theme_bw() + scale_y_continuous(breaks = seq(0,100,10)) +ggtitle("Heritability for grain yield per simulated environment assuming unequal error variance") +ylab("Heritability (%)") + xlab("Environment")+themecustomH2()+theme(plot.subtitle = element_text(face = "italic")) +scale_fill_manual(values = cbPallete)+ theme(legend.position = "bottom") 
    ggsave("Results/Simulated_Env_HeritabilityUnequalVar.png",width = 15, height = 10)  
    
#..ii) VC for Stage 2 ####
asreml.options(trace=FALSE)      
  ModS2<-asreml(fixed = predicted.value ~ 1, random= ~pedigree +environment+pedigree:environment,
                           family=asr_gaussian(dispersion=1),weights = S2.weight, data = BLUEsUnevenDT)
## Online License checked out Sun Jul 10 19:11:02 2022
  vc<-as.data.table(summary(ModS2)$varcomp,keep.rownames = T)
  vc[rn=='units!R',rn:='Error']
    vc[rn=='Error',component:=mean(VCUnevenDT[id=='units!R']$component)]
#..format VC   
  vc$Proportion<-round(vc$component/sum(vc$component),digits = 4) 
   vc$component<-round(vc$component,digits = 2)
    vc$z.ratio<-round(vc$z.ratio,digits = 2)

#show table
  kable_styling(kable(vc[,c('rn','component','z.ratio','Proportion')],caption = 'Second Stage Variance Components'))
Second Stage Variance Components
rn component z.ratio Proportion
environment 46.70 1.22 0.1849
pedigree 0.00 NA 0.0000
pedigree:environment 136.91 23.47 0.5422
Error 68.92 NA 0.2729
save.image("SavePoints/3-METcreation.RData")

When comparing the range in the heritabilities between MET assuming equal compared to unequal variances, the range is heritability estimates is greater with the unequal variances as would be expected and commonly observed in METs.The zero genetic correlation between the environments helps explain why the main effect of the genotype is so small across the environments.

#C) Plot the environment distributions ####
BLUEsUnevenDT<-fread("Results/BLUEsUnevenErrorH2.csv")

#Plot aesthetics
themeValues<-function(x) theme(plot.title=element_text(hjust=0.5),text = element_text(size=18),strip.text.x = element_text(size=14), axis.text.x = element_text(angle=45, hjust=1,size=14),axis.text.y = element_text(angle = 0, hjust = 1,size = 15),plot.subtitle = element_text(face = 'italic',size = 15))
  
#Plot 
ggplot(BLUEsUnevenDT,aes(predicted.value)) + geom_density(aes(color=environment))+ ggtitle('Distribution of simulated phenotype controlled by 100 QTL across 4 Environments',subtitle = 'Distribution across environments')+theme_bw() + themeValues()+ facet_grid(~environment,scales = 'free')+xlab('Phenotypic Value')
  ggsave("Results/Simulated_BLUEs.png",width = 15,height = 8)

From the figure above, the environment one to four have a mean of 180, 175, 170, 190. These values are in line with the vector of environmental averages provided as the values ranged between 160 and 190.

2 Stage Analysis Simulation

The below example demonstrates the utility of the simulation in the context of a 2 stage analysis using an experimental design that simulates the Genomes to Fields (G2F) Inisiative in 2020 and 2021. In this experiment, the WI-SS-MAGIC population is crosssed to three commerical testers ranging in maturity and the resulting testcross populations are primarily grown in environments highly adapted to the commerical tester. For the simulation described below, the tester effect is ignored and not incldued.

Environment Assignment

For the environmental assignment, the experiment was designed such that 375 genotypes would be grown in each environment and approximatley 17.4% of the experimental hybrids are replicated twice.This leads to a total of 440 EUs worth of hybrids. However, an additional 30 check hybrids are included to generate a total of 500 EUs per environment. Finally, the total number of observations for each hybrid is desired to be balanced across the total numebr of environments. This scenairo can be simulated using the function SimEnvironments by setting RepAssignment to ‘Balanced’.

source("~/Simulations/G2Fsimulation/3-PhenotypeSimulations/EnvironmentAssignment.R")

#1) Experimental Inbreds ####        
  EnvExpInbreds<-SimEnvironments(NumEU = 440,PopSize = 375,nEnv = 23,nRep = 2,percentRep =  0.174,RepAssignment = 'Balanced')

#2) Checks  ####
CheckGenos<-paste0('Check',seq(1,30,1))
  EnvChecks<-SimEnvironments(NumEU = 60,PopSize = 30,nEnv = 23,nRep = 2,percentRep = 1,GenoName = CheckGenos)
    fwrite(rbind(EnvExpInbreds,EnvChecks),"Results/G2Fsimulation/G2F_EnvAssignments.csv")
#Count the plots per hyrbid
EnvExpInbreds<-fread("Results/G2Fsimulation/G2F_EnvAssignments.csv")[!Genotype %like% 'Check']
EnvExpInbreds$Rep<-as.factor(EnvExpInbreds$Rep)
  countH<-EnvExpInbreds[,.N, by=.(Genotype)]
    hist(countH$N,main='Observations per Experimental Hybrid',xlab = 'Observations')

#3) Plot Environment Assignments ####        
    
#Aesthetics
  cbPallete<-c("#56B4E9","#E69F00", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7","#CC6677", "#DDCC77", "#117733", "#332288","#44AA99", "#999933", "#882255","#999999","#D55E00", "#CC79A7","#661100", "#6699CC", "#888888","#999999")
    
themeEnvAssign<-function(x) theme(plot.title=element_text(hjust=0.5,size=18,face="bold"),strip.text.y = element_text(size=17),strip.text.x = element_text(size=16), axis.text.x = element_text(angle=90, hjust=1,size=3),axis.text.y = element_text(angle = 0, hjust = 1,size = 15),plot.subtitle = element_text(face = "italic",size = 16),axis.title.x = element_text(size = 18),axis.title.y = element_text(size = 18))

#Plot
ggplot(EnvExpInbreds,aes(Genotype,Env)) +geom_tile(aes(x=Genotype,y=Env,fill=Rep)) +theme_bw()+themeEnvAssign()+ggtitle('Number of Experimental Hybrids per Environment') + xlab('Genotype') + ylab("Environment")+scale_fill_manual(values = cbPallete)
ggsave("Results/G2Fsimulation/G2F_SimulatedEnvAssignments.png",width = 15,height = 10)      

In the figure above, all genotypes are grown in each environment and are observed at equal frequency overall. Given the total number of environments, only a small number of hybrids were not 27 times.

Genetic Model

  • Genetic values of progeny are simulated for four traits composed of 100, 40, 30, and 30 genomic regions with additive effects set to the default parameters and assuming no polygenic effect. Phenotypes are simulated for 23 environments, the average number of environmnets across the testcross populations for the G2F experiment.
source("~/Simulations/G2Fsimulation/3-PhenotypeSimulations/SimulatePhenotype.R")

#Load in the Environmental Assignments
envG2F<-fread("Results/G2Fsimulation/G2F_EnvAssignments.csv")[Genotype %like% 'Geno']

#3) Simulate Additive values ####
gt<-fread('Results/NumericGenotypeWI-SS-MAGICmaf.csv')
#..Convert to matrix
  gtMat<-as.matrix(gt[,-c(1)])
    rownames(gtMat)<-gt$Genotype
    
#Phenotypes from markers with 0 tester effect
#..GY
  GYg2f<-SimPhenotypes(envDT = envG2F,X = gtMat,
            nQTL =100,QTLallele = 'Major')
    GYg2f$Phenotype<-'GY'
      GYg2f$Position<-100
#..GM  
  GMg2f<-SimPhenotypes(envDT = envG2F,X = gtMat,
            nQTL = 40,QTLallele = 'Major')
GMg2f$Phenotype<-'GM'
  GMg2f$Position<-40

#..Flowering Time
  Flowerg2f<-SimPhenotypes(envDT = envG2F,X = gtMat,
                nQTL = 30,QTLallele = 'Major')
Flowerg2f$Phenotype<-'Flower'
  Flowerg2f$Position<-30
#..Plant Height
  PHg2f<-SimPhenotypes(envDT = envG2F,X = gtMat,
              nQTL =30,QTLallele = 'Major')
PHg2f$Phenotype<-'PH'
  PHg2f$Position<-30

#Combine the phenotypes
PhenoExG2F<-rbind(GYg2f,GMg2f,Flowerg2f,PHg2f)      
  PhenoExG2F$Phenotype<-factor(as.factor(PhenoExG2F$Phenotype), levels = c('PH','Flower','GM','GY'))

#Obtain values for checks
  CheckList<-list()
phenoDT<-data.table(Phenotype=c('PH','Flower','GM','GY'), nQTL=c(30,30,40,100))  
  envG2Fchecks<-fread("Results/G2Fsimulation/G2F_EnvAssignments.csv")[Genotype %like% 'Check']
      for (i in 1:nrow(phenoDT)){
        checkPheno<-SimPhenotypes(envDT = envG2Fchecks,
              nQTL =phenoDT[i,]$nQTL)
        checkPheno[,`:=`(Phenotype=paste(phenoDT[i,]$Phenotype), Position=paste(phenoDT[i,]$nQTL))]
CheckList[[i]]<-checkPheno        
  }
CheckDTg2f<-rbindlist(CheckList)
  CheckDTg2f$Genotype<-gsub('Line','Check',CheckDTg2f$Genotype)
  
#Save 
  fwrite(rbind(PhenoExG2F,CheckDTg2f),"Results/G2Fsimulation/G2F_SimulatedValues.csv")
  PhenoExG2F<-fread("Results/G2Fsimulation/G2F_SimulatedValues.csv")[Genotype %like% 'Geno']

#4) Plot the Results ####  
phenoAes<-data.table(Phenotype=c('Plant Height','Flower Time','Grain Moisture','Grain Yield'), Position=c(30,30,40,100))
  kable_styling(kable(phenoAes,caption = 'Phenotype Breakdown'))
Phenotype Breakdown
Phenotype Position
Plant Height 30
Flower Time 30
Grain Moisture 40
Grain Yield 100
#..A) Plot ####
  themeValues<-function(x) theme(plot.title=element_text(hjust=0.5),text = element_text(size=18),strip.text.x = element_text(size=14), axis.text.x = element_text(angle=45, hjust=1,size=14),axis.text.y = element_text(angle = 0, hjust = 1,size = 15),plot.subtitle = element_text(face = 'italic',size = 15))

#Plot
ggplot(PhenoExG2F,aes(BreedVal)) + geom_density(aes(color=Phenotype))+ ggtitle('Distribution of additive values across 4 Traits',subtitle = 'Distribution across environments')+theme_bw() + themeValues() +facet_grid(Phenotype~Env,scales = 'free')+xlab('Simulated Additive Values')
  ggsave("Results/G2Fsimulation/G2F_Simualted_BV_Position.png",width = 20,height = 8)

Generate MET

  • For the MET, the means of the phenotypes for grain yield (GY), grain moisture (GM), flowering time (Flower), and plant height (PH) were given a mean of 170, 20%, 1350 GDUs and 180cm respectively.
    • No tester effect is simulated so the mean for plant height is a value commonly observed for inbreds and is not reflective of a value commonly observed in hyrbid progeny.
source("~/Simulations/G2Fsimulation/3-PhenotypeSimulations/SimulateMET.R")
setwd("~/Simulations/G2Fsimulation/4-SimulationDemonstration/")

#Simulate the MET
gyAddG2F<-fread("Results/G2Fsimulation/G2F_SimulatedValues.csv")
envG2F<-fread("Results/G2Fsimulation/G2F_EnvAssignments.csv")
PhenoList<-list()
  ErrorDT<-rbind(data.table(Phenotype=c('GY'),Value=c(seq(.50,.90,.01))),
                 data.table(Phenotype=c('GM'),Value=c(seq(.40,.70,.01))),
                   data.table(Phenotype=c('Flower'),Value=c(seq(.30,.40,.01))),
                    data.table(Phenotype=c('PH'),Value=c(seq(.20,.40,.01))))
#Trait Means  
phenosDT<-data.table(Phenotype=c('GY','GM','Flower','PH'),
            Mean=c(mean(seq(140,200,1)),mean(seq(15,25,1)),mean(seq(1200,1400,1)),mean(seq(175,225,1))),
                EnvVar=c(var(seq(140,200,1)),var(seq(15,25,1)),var(seq(1200,1400,1)),var(seq(175,225,1))))
  
for (i in 1:nrow(phenosDT)){
  phenoSub<-gyAddG2F[Phenotype==phenosDT[i,]$Phenotype][,c('Genotype','Env','BreedVal','GenoVal')]
    METunevenErrorDT<-SimMET(dt = envG2F,gt = phenoSub,u = phenosDT[i,]$Mean,Ve = phenosDT[i,]$EnvVar,Verror = ErrorDT[Phenotype ==phenosDT[i,]$Phenotype]$Value)
      METunevenErrorDT$Phenotype<-phenosDT[i,]$Phenotype
#..save the results    
PhenoList[[i]]<-METunevenErrorDT    
}
 phenoG2F <-rbindlist(PhenoList)
  fwrite(phenoG2F,"Results/G2Fsimulation/G2F_SimulatedMETunevenDT.csv")
phenoG2F<-  fread("Results/G2Fsimulation/G2F_SimulatedMETunevenDT.csv")

phenoG2F[Genotype %like% 'Check',Check:='Check']
phenoG2F[!Genotype %like% 'Check',Check:='Experimental Hybrid']

#A) Plot the means
themeValues<-function(x) theme(plot.title=element_text(hjust=0.5),text = element_text(size=18),strip.text.x = element_text(size=14), axis.text.x = element_text(angle=45, hjust=1,size=12),axis.text.y = element_text(angle = 0, hjust = 1,size = 15),plot.subtitle = element_text(face = 'italic',size = 15))
cbPallete<-c("#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7","#CC6677", "#CC3311","#009988")

#Plot
ggplot(phenoG2F,aes(PhenoVal)) + geom_histogram(aes(fill=Check),alpha=.4,color='black')+ ggtitle('Distribution of phenotypes  across 4 Traits',subtitle = 'Distribution across environments')+theme_bw() + themeValues() +facet_grid(Env~Phenotype,scales = 'free')+xlab('Simulated Phenoytpes')+scale_fill_manual(values = cbPallete)
  ggsave("Results/G2Fsimulation/G2F_Simulated_Phenos.png",width = 8,height = 15)

The figure above shows the within environment distribution for both the simulated 30 check lines and 370 experimental hybrids. Clearly the variation for the 370 experimental hybrids is much larger than that of the checks. Beyond the within environment difference between the checks and hybrids, there appears to be differences in the mean value of each environments per phenotype

2 Stage Analysis

As was conducted above, a two stage analysis, as described by Smith et al. (2001), Piepho et al. (2012), and Damesa et al (2017) can be conducted to handle the uniuque error variances associated with each environment and the differences in replication for each experimental hybrid within an environment.

The analysis works by first estimating BLUEs per environment and phenotype. During this stage, within environment heritability is calculated using the Cullis Method (Cullis et al., 2006) to account for the difference in reliability between the hybrids. Following, cross environment heritability and components of variance are estimated and BLUPs are calculated.

source("~/Genomes2Fields/1-Stage1/Stage1Function.R")
 phenoG2F <-fread("Results/G2Fsimulation/G2F_SimulatedMETunevenDT.csv")

#..A) Run Stage 1 ####
  phenoG2F$tester<-'No_effect'
    phenoG2FSub<-phenoG2F[,c('Genotype','Env','Rep','Phenotype','tester','PhenoVal')]
  
#Format for stage 1 function
  setnames(phenoG2FSub,c('Genotype','Env','Rep','Phenotype','PhenoVal'),c('pedigree','environment','rep','variable','value'))    

#Stage 1    
  S1resultsG2F<-Stage1Analysis(dt.tidy = phenoG2FSub,dt.covar = NULL,covars = NULL,SpatCor = .60,no_covar = NULL,DF = 30)  
#..Save results    
  h2G2F<-rbindlist(S1resultsG2F[[6]])
    fwrite(h2G2F,"Results/G2Fsimulation/H2_G2F.csv")
  VCG2F<-rbindlist(S1resultsG2F[[1]])
    fwrite(VCG2F,"Results/G2Fsimulation/VC_G2F.csv")   
  BLUEsG2F<-rbindlist(S1resultsG2F[[4]])
    fwrite(BLUEsG2F,"Results/G2Fsimulation/BLUEs_G2F.csv")
#Load in the data
h2G2F<-fread("Results/G2Fsimulation/H2_G2F.csv")
VCG2F<-fread("Results/G2Fsimulation/VC_G2F.csv")
BLUEsG2F<-fread("Results/G2Fsimulation/BLUEs_G2F.csv")

#Plot aesthetics
themecustomH2<-function(x) theme(plot.title=element_text(hjust=0.5),text = element_text(size=18),strip.text.y = element_text(size=25),  axis.text.x = element_text(angle=45, hjust=1,size=15),axis.text.y = element_text(angle = 0, hjust = 1,size = 15))

cbPallete<-c("#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7","#CC6677", "#CC3311","#009988")

#..i) Plot the Heritability ####
    h2Sub<-h2G2F[,.SD[which.max(H2_cullis)],by=.(environment,pheno)]

#Plot maximum heritability by year      
ggplot(h2Sub,aes(x=environment, y=H2_cullis,fill=pheno))+geom_col(position = "dodge",aes(x=environment,y=H2_cullis,fill=pheno),width = .5) +theme_bw() + scale_y_continuous(breaks = seq(0,100,10)) +ggtitle("Within Environment Heritability by Phenotype") +ylab("Heritability (%)") + xlab("Environment")+themecustomH2()+theme(plot.subtitle = element_text(face = "italic")) +scale_fill_manual(values = cbPallete)+ theme(legend.position = "bottom") +
  facet_grid(~pheno,scales = 'free')
    ggsave("Results/G2Fsimulation/HeritabilityEnv.png",width = 15, height = 10)  

The within environment heritability appears to dramatically vary across environments and at times can be 0 depending on the magnitude of the error variance assigned to the environment. As expected, grain yield on average has a lower heritability than the remaining traits.

#Load in the data
h2G2F<-fread("Results/G2Fsimulation/H2_G2F.csv")
VCG2F<-fread("Results/G2Fsimulation/VC_G2F.csv")
BLUEsG2F<-fread("Results/G2Fsimulation/BLUEs_G2F.csv")

#Plot the BLUEs
themePheno<-function(x) theme(plot.title=element_text(hjust=0.5),text = element_text(size=18),
                                  strip.text.x = element_text(size=18), axis.text.x = element_text(angle=45, hjust=1,size=15),
                                           axis.text.y = element_text(angle = 0, hjust = 1,size = 15),plot.subtitle = element_text(face = 'italic',size = 15))
#Box plots of Phenotypes
ggplot(BLUEsG2F,aes(x=environment,y=predicted.value)) + geom_boxplot(aes(fill=environment),color='black',alpha=.4,outlier.colour = 'red') + ggtitle('Distribution of simulated BLUEs',
  subtitle = '30, 30, 40, and 100 QTL assumed to be controlling each trait respectively \n Random environmental trait heritability of 60 to 70, 60 to 80, 30 to 60 and 10 to 50 respectively')+
    theme_bw() + themePheno() + facet_wrap(~pheno,scales = 'free',ncol = 2) + xlab('Simulated BLUEs')
ggsave("Results/G2Fsimulation/BLUEsBoxPlot.png",width = 15,height = 12)  


#B) VC for Stage 2 ####
  phenos<-unique(BLUEsG2F$pheno)
    VCList<-list()
for (i in 1:length(phenos)){
  bluesSub<-BLUEsG2F[pheno==phenos[i]]
    bluesSub$pedigree<-as.factor(bluesSub$pedigree) 
      bluesSub$environment<-as.factor(bluesSub$environment) 

#Run Model           
asreml.options(trace=FALSE)      
  ModS2g2f<-asreml(fixed = predicted.value ~ 1, random= ~pedigree +environment+pedigree:environment,
                           family=asr_gaussian(dispersion=1),weights = S2.weight, data = bluesSub)
  vc<-as.data.table(summary(ModS2g2f)$varcomp,keep.rownames = T)
  vc[rn=='units!R',rn:='Error']
    vc[rn=='Error',component:=mean(VCG2F[pheno==phenos[i] & id=='units!R']$component)]
#....Cullis estimated and entry-mean heritability
    PredS2G2f<-predict.asreml(ModS2g2f,classify = "pedigree",present = "pedigree")
      vc$h2_cullis<- (1 - ((PredS2G2f$avsed**2) / (2 * vc[rn=="pedigree"]$component)))* 100
#..format VC   
  vc$Proportion<-round(vc$component/sum(vc$component),digits = 4) 
   vc$component<-round(vc$component,digits = 2)
    vc$z.ratio<-round(vc$z.ratio,digits = 2)
    vc$pheno<-phenos[i]
VCList[[i]]<-vc
}
VCStage2DT<-rbindlist(VCList)
  fwrite(VCStage2DT,"Results/G2Fsimulation/VCStage2.csv")
  
#..i) Plot the variance components ####
cbPallete<-c("#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7","#CC6677", "#DDCC77", 
             "#117733", "#332288","#44AA99", "#999933", "#882255","#999999",
                      "#D55E00", "#CC79A7","#661100", "#6699CC", "#888888","#999999")
  
themeVarComp<-function(x) theme(plot.title=element_text(hjust=0.5,size=22,face = "bold"),strip.text.y = element_text(size=22), plot.subtitle = element_text(size = 20,face = "italic"), axis.text.x = element_text(angle=45, hjust=1,size=15),strip.text.x = element_text(size=17),axis.text.y = element_text(angle = 0, hjust = 1,size = 20),axis.title.x = element_text(size=18),axis.title.y = element_text(size=18))
  
#Update labels for figure
  VCStage2DT$rn<-gsub('pedigree','Hybrid',VCStage2DT$rn)
VCStage2DT$rn<-factor(levels = c("Error","environment","Hybrid:environment","Hybrid"),as.factor(VCStage2DT$rn))
  setnames(VCStage2DT,c('rn'),c('Components'))

#Plot
ggplot(VCStage2DT,aes(x=pheno,y=Proportion,fill=Components))+geom_bar(aes(x=pheno,y=Proportion,fill=Components),stat = "identity",alpha=.8,color="black")+ylab("Percent Variance")+theme(legend.position = "right")+ggtitle("Variance components by phenotype") + theme_bw()+themeVarComp() + scale_fill_manual(values = cbPallete)+xlab("Phenotype") + theme(legend.text = element_text(size = 13),legend.title = element_text(size=15),legend.position = 'bottom')
    ggsave("Results/G2Fsimulation/VC.phenos.png",width = 15,height = 10)  
  
#show table of variance components
  scroll_box(kable_styling(kable(VCStage2DT[,c('pheno','Components','component','z.ratio','Proportion')],caption = 'Second Stage Variance Components for Simulated MET Experiment')),height = '150px')
Second Stage Variance Components for Simulated MET Experiment
pheno Components component z.ratio Proportion
GY environment 283.59 3.31 0.2462
GY Hybrid 111.59 12.28 0.0969
GY Hybrid:environment 397.68 65.96 0.3452
GY Error 359.06 NA 0.3117
GM environment 11.29 3.27 0.0605
GM Hybrid 51.21 13.48 0.2744
GM Hybrid:environment 60.85 63.81 0.3261
GM Error 63.25 NA 0.3390
Flower environment 2934.08 3.32 0.9636
Flower Hybrid 50.56 13.83 0.0166
Flower Hybrid:environment 29.64 61.94 0.0097
Flower Error 30.78 NA 0.0101
PH environment 151.17 3.32 0.6029
PH Hybrid 54.70 13.94 0.2181
PH Hybrid:environment 22.25 60.95 0.0887
PH Error 22.63 NA 0.0903
#show table of H2 
h2Stage2<-VCStage2DT[,.SD[which.max(h2_cullis)],by=.(pheno)]  
h2Stage2$h2_cullis<-round(h2Stage2$h2_cullis,digits = 2)
  kable_styling(kable(h2Stage2[,c('pheno','h2_cullis')],caption = 'Second Stage Cullis heritability by phenotype'))
Second Stage Cullis heritability by phenotype
pheno h2_cullis
GY 86.46
GM 94.88
Flower 97.33
PH 98.10
save.image("SavePoints/6-G2F_2StageAnalysis.RData")

From above, the smallest hyrbid main effect additive is associated with grain yield, a trait controlled by 100 QTL. Alternatively, the traits controlled by 30 QTL, such as plant height and flowering time, have larger hyrbid main effects and smaller GxE relative to grain yield. Grain yield also appears to have the highest residual variance compared to the remaining traits. The final heritabilities are very large as the GxE term is divided by the total number of environments, 23, and the error term is divided by the total number of plots across the whole experiment.